Topic Modelling & Image Classification

Content¶

Text¶

  • Importing libraries
  • Getting data from Yelp API
  • Preprocessing text data
    • Checking for null values
    • WordCloud for visualizing the word frequency before
    • Punctuation removal
    • Lowering text
    • Removing languages other than english
    • Stop Word removal
    • Lemmatization, Tokenization and removal of neutral words
    • Rechecking the word frequency by plotting wordcloud
    • Creating Bi/tri grams
    • Creating Bag of words
    • Creating TF-IDF
    • Word embedding using Fasttext
  • Sentiment analysis (Textblob, VADER, Flair)
  • PCA
  • t-SNE
  • Clustering for topics and visualization
  • LDA (Gensim and sklearn) for topics
  • NMF (Gensim and sklearn) for topics

Importing libraries

In [6]:
import numpy as np
import pandas as pd
import random
from sys import getsizeof
import gql
import os
import matplotlib.pyplot as plt
import seaborn as sns
import shutil

import warnings
warnings.filterwarnings('ignore')

Fetching data from Yelp API¶

Using GraphQL API to load 3 reviews and 1 photos for 50 restaurants from 20 different locations:

  • 1000 restaurants
  • 1000 photos
  • 3000 comments
In [7]:
from gql import gql, Client
from gql.transport.aiohttp import AIOHTTPTransport
import json
locations = ['San Francisco', 'New York', 'Seattle', 'Philadelphia', 'Houston', 'Chicago', 'Denver', 'San Diego', 'Phoenix', 'San Antonio',
                'Nashville', 'Los Angeles', 'San Jose', 'Indianapolis', 'Fort Worth', 'Oklahoma City', 'Miami', 'Boston', 'Austin', 'Portland']

header = {'Authorization': 'bearer {}'.format("fwOPAjnuCwq93veb-5IBlBiW14fiAQODhmhInODSkfVoj7m1VcWYWFgGC-u5v0og_IA7gESAq-hIcr3MT9TIXyCUPv99I9BgkHrwOaF0uAT3FPmqB1H0pdtGmPrkY3Yx"),
          'Content-Type':"application/json"}

# Select your transport with a defined url endpoint
transport = AIOHTTPTransport(url='https://api.yelp.com/v3/graphql', headers=header)

# Create a GraphQL client using the defined transport
client = Client(transport=transport, fetch_schema_from_transport=True)

if os.path.isdir("data"):
   shutil.rmtree('data')

os.mkdir("data")
listresultJson = []
# Provide a GraphQL query
# Execute the query on the transport
for index,l in enumerate(locations):
  result = await client.execute_async(gql(
  '''{search(location: "'''+l+'''", limit:50) {
            business {
                name
                photos
                price
                review_count
            reviews {
                text
                rating
                time_created
            }
            location {
                    city
                    state
                    postal_code
                    country
                }
             categories {
                    alias
                    parent_categories {
                        alias
                    }
        }
        }
    }
    }
    '''
  ))

  listresultJson.append(result['search']['business'])
In [8]:
# extrating the data fetched
cList = []
for l in listresultJson:
    for x in l:
        cList.append(x)
    
with open('data/data.json', 'w') as output_filec:
    json.dump(cList, output_filec)
In [9]:
with open('data/data.json') as sam:
    d = json.load(sam)

# Converting the data into a pandas dataframe and saving the csv file
data = pd.json_normalize(d)
data.to_csv("data/data.csv", sep='\t')
data
Out[9]:
name photos price review_count reviews categories location.city location.state location.postal_code location.country
0 Fog Harbor Fish House [https://s3-media2.fl.yelpcdn.com/bphoto/by8Hh... $$ 9992 [{'text': 'Enjoyed celebrating my bday with my... [{'alias': 'seafood', 'parent_categories': [{'... San Francisco CA 94133 US
1 House of Prime Rib [https://s3-media4.fl.yelpcdn.com/bphoto/HLrja... $$$ 8875 [{'text': 'Never disappoint! Great food and ve... [{'alias': 'tradamerican', 'parent_categories'... San Francisco CA 94109 US
2 Kokkari Estiatorio [https://s3-media2.fl.yelpcdn.com/bphoto/FTQfP... $$$ 5188 [{'text': 'Exceptional food all around, from t... [{'alias': 'greek', 'parent_categories': [{'al... San Francisco CA 94111 US
3 Marufuku Ramen [https://s3-media4.fl.yelpcdn.com/bphoto/ouK2V... $$ 4919 [{'text': 'Very nice restaurant. Good ambiance... [{'alias': 'ramen', 'parent_categories': [{'al... San Francisco CA 94115 US
4 Gary Danko [https://s3-media1.fl.yelpcdn.com/bphoto/Rqsfo... $$$$ 5927 [{'text': 'Gary Danko is our favorite SF resta... [{'alias': 'newamerican', 'parent_categories':... San Francisco CA 94109 US
... ... ... ... ... ... ... ... ... ... ...
995 Pip's Original Doughnuts & Chai [https://s3-media2.fl.yelpcdn.com/bphoto/vZljJ... $ 3070 [{'text': 'There was quite the line on a Satur... [{'alias': 'coffee', 'parent_categories': [{'a... Portland OR 97213 US
996 Ava Gene's [https://s3-media3.fl.yelpcdn.com/bphoto/sckkK... $$$ 752 [{'text': 'I can't say enough about what an in... [{'alias': 'newamerican', 'parent_categories':... Portland OR 97202 US
997 Farmhouse Kitchen Thai Cuisine [https://s3-media2.fl.yelpcdn.com/bphoto/egThi... $$ 557 [{'text': 'You got a craving for beef noodle s... [{'alias': 'thai', 'parent_categories': [{'ali... Portland OR 97209 US
998 Gilda's Italian Restaurant [https://s3-media2.fl.yelpcdn.com/bphoto/QL2FW... $$ 618 [{'text': 'I had almost completely forgotten ... [{'alias': 'italian', 'parent_categories': [{'... Portland OR 97205 US
999 Bluefin Tuna & Sushi [https://s3-media4.fl.yelpcdn.com/bphoto/SkMBs... $$$ 259 [{'text': 'I knew I wanted to eat sushi in Por... [{'alias': 'sushi', 'parent_categories': [{'al... Portland OR 97232 US

1000 rows × 10 columns

Getting Data¶

Using json reviews data provided by yelp for sentiment analysis. Taking 5000 random reviews from the json dataset.

In [10]:
# json file for reviews
reviewFile = "/kaggle/input/yelp-dataset/yelp_academic_dataset_review.json"

# assigning data types to the feature for memory optimization
features = {
    "review_id": str,
    "user_id": str,
    "business_id": str,
    "stars": 'int8',
    "useful": 'int8',
    "funny": 'int8',
    "cool": 'int8',
    "text": str,
    "date": "datetime64[ns]",
}

chunks = []  # Initialize an empty list to store chunks
with pd.read_json(reviewFile, dtype=features, chunksize=100000, lines=True) as jsonReader:
    for chunk in jsonReader:
        chunks.append(chunk)  # Append each chunk to the list

reviewData = pd.concat(chunks, ignore_index=True)  # Concatenate all chunks into a single DataFrame
In [11]:
reviewData  = reviewData.sample(5000, random_state=42)
reviewData
Out[11]:
review_id user_id business_id stars useful funny cool text date
1295256 J5Q1gH4ACCj6CtQG7Yom7g 56gL9KEJNHiSDUoyjk2o3Q 8yR12PNSMo6FBYx1u5KPlw 2 1 0 0 Went for lunch and found that my burger was me... 2018-04-04 21:09:53
3297618 HlXP79ecTquSVXmjM10QxQ bAt9OUFX9ZRgGLCXG22UmA pBNucviUkNsiqhJv5IFpjg 5 0 0 0 I needed a new tires for my wife's car. They h... 2020-05-24 12:22:14
1217795 JBBULrjyGx6vHto2osk_CQ NRHPcLq2vGWqgqwVugSgnQ 8sf9kv6O4GgEb0j1o22N1g 5 0 0 0 Jim Woltman who works at Goleta Honda is 5 sta... 2019-02-14 03:47:48
3730348 U9-43s8YUl6GWBFCpxUGEw PAxc0qpqt5c2kA0rjDFFAg XwepyB7KjJ-XGJf0vKc6Vg 4 0 0 0 Been here a few times to get some shrimp. The... 2013-04-27 01:55:49
1826590 8T8EGa_4Cj12M6w8vRgUsQ BqPR1Dp5Rb_QYs9_fz9RiA prm5wvpp0OHJBlrvTj9uOg 5 0 0 0 This is one fantastic place to eat whether you... 2019-05-15 18:29:25
... ... ... ... ... ... ... ... ... ...
5884448 bXXRzBg7DuGnY8ij4INBWg 9fP3KiiVpFVYcnqgD3aZJw iaBU5h_j0TCrUFzTbjFIlw 3 9 0 0 I am not sure what to think of this place. I b... 2013-04-09 22:29:48
6745875 FkekUQC8z63ywSFQnK4Z4w JLW2uULP_Q1KXHhToNljcQ jMStvE-tQzSpRCAO0nAE6g 3 5 2 8 I'm so excited to see the red Robin had re-ope... 2018-09-27 23:47:13
5730804 4IzbwfjgwUq1gXKA97Erwg lESGYBwhs9ZtpWeJf_2Zig hGCETx03FN8Qtx1T8StHaQ 5 0 0 0 This is our go-to pizza place! We love their ... 2018-09-05 23:00:37
1995249 23xRe5HtAsPlHyUuM7AFTQ 5pgl40PSrB-dTbEg-eWIFA ecapYwbEvmKHKAfsGA4tow 4 3 0 0 This is located in a great spot fairly close t... 2014-02-13 22:54:43
6544963 vLxH2ifmZw8htzm_WZCGVw W0DJOPsSwcAj0uqCJG8iLw aGOXuqO6yhN66tLYI61Thg 2 1 0 0 I went in for a sirloin burger and a salad. Th... 2015-05-08 02:42:30

5000 rows × 9 columns

Preprocessing text data¶

Checking for null values¶

In [12]:
reviewData.isnull().any()
Out[12]:
review_id      False
user_id        False
business_id    False
stars          False
useful         False
funny          False
cool           False
text           False
date           False
dtype: bool

Generating word cloud for visualization¶

In [13]:
from wordcloud import WordCloud
plt.figure(figsize=(20,10))
#Creating the text variable
textWordCloudBefore = " ".join(cat.split()[1] for cat in reviewData.text)
# Creating word_cloud with text as argument in .generate() method
word_cloud = WordCloud(collocations = False, background_color = 'white', width=2000, height=1000).generate(textWordCloudBefore)
# Display the generated Word Cloud
plt.imshow(word_cloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Punctuation Removal¶

Capital letters and full stops, question marks, commas, colons and semi-colons, exclamation marks and quotation marks.

In [14]:
import string
string.punctuation

processedReviewData = reviewData.copy()
processedReviewData.reset_index(drop=True, inplace=True)

#defining the function to remove punctuation
def remove_punctuation(text):
    punctuationfree="".join([i for i in text if i not in string.punctuation])
    punctuationfree="".join([i for i in punctuationfree if i not in ['\n', '\t', '\b']])
    return punctuationfree
#storing the puntuation free text
processedReviewData['text_punct_reml']= processedReviewData['text'].apply(lambda x:remove_punctuation(x))
processedReviewData['text_punct_reml']
Out[14]:
0       Went for lunch and found that my burger was me...
1       I needed a new tires for my wifes car They had...
2       Jim Woltman who works at Goleta Honda is 5 sta...
3       Been here a few times to get some shrimp  They...
4       This is one fantastic place to eat whether you...
                              ...                        
4995    I am not sure what to think of this place I bo...
4996    Im so excited to see the red Robin had reopene...
4997    This is our goto pizza place  We love their cr...
4998    This is located in a great spot fairly close t...
4999    I went in for a sirloin burger and a salad The...
Name: text_punct_reml, Length: 5000, dtype: object

Lowering the text¶

In [16]:
processedReviewData['text_lower']= processedReviewData['text_punct_reml'].apply(lambda x: x.lower())

Language checking¶

Using library langdetect to check the languages other than english and removing those rows from the data, langDetect check the most frequenct phases ex: "like to" to check if they exist marking it the corresponding langauage.

In [18]:
pip install langdetect
Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 981.5/981.5 kB 26.7 MB/s eta 0:00:0000:01
  Preparing metadata (setup.py) ... done
Requirement already satisfied: six in /opt/conda/lib/python3.10/site-packages (from langdetect) (1.16.0)
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... done
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993224 sha256=53b09ceee945e52ed79afc8d745d19bfb2b9b761188cf15df169afd7dc67090d
  Stored in directory: /root/.cache/pip/wheels/95/03/7d/59ea870c70ce4e5a370638b5462a7711ab78fba2f655d05106
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9
Note: you may need to restart the kernel to use updated packages.
In [19]:
import langdetect

languages_langdetect = []

# the try except blook because there is some tweets contain links
for line in processedReviewData['text_lower']:
    try:
        result = langdetect.detect_langs(line)
        result = str(result[0])[:2]
    except:
        result = 'unknown'
    
    finally:
        languages_langdetect.append(result)

processedReviewData['languages']=languages_langdetect

processedReviewData['languages'].unique()
Out[19]:
array(['en', 'es'], dtype=object)
In [20]:
for l in processedReviewData['languages'].unique():
    if l != 'en':
        print(processedReviewData[(processedReviewData['languages']==l)].text)
3543    El po boy estaba bueno. No probé más nada pero...
Name: text, dtype: object

Rows of different language than english, droping them.

In [21]:
for l in processedReviewData['languages'].unique():
        if l != 'en':
                processedReviewData.drop(processedReviewData[(processedReviewData['languages']==l)].index, axis=0, inplace=True)

processedReviewData.reset_index(inplace=True)

Stop word removal¶

Stopwords are English words which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence.

In [22]:
# Cleaning the texts
import nltk
import re
from nltk.corpus import stopwords
from gensim.parsing.preprocessing import STOPWORDS

def cleaningText(text):
    sentences = nltk.sent_tokenize(text)
    review = ""
    for i in range(len(sentences)):
        review = re.sub('[^a-zA-Z]', ' ', sentences[i])
        review = review.split()
        review = [word for word in review if not word.lower() in set(stopwords.words('english'))]
        review = [word for word in review if not word.lower() in set(STOPWORDS)]
        for x in review:
            if x.lower() == 'i':
                print(x)
        review = ' '.join(review)
    return review

processedReviewData['cleanText'] = processedReviewData['text_lower'].apply(lambda x : cleaningText(x))
processedReviewData['cleanText']
Out[22]:
0       went lunch burger meh obvious focus burgers di...
1       needed new tires wifes car special order day d...
2       jim woltman works goleta honda stars knowledge...
3       times shrimp theyve got nice selection differe...
4       fantastic place eat hungry need good snack goo...
                              ...                        
4994    sure think place bought groupon year ago arriv...
4995    im excited red robin reopened closer tucson ma...
4996    goto pizza place love crust toppings perfect d...
4997    located great spot fairly close downtown beach...
4998    went sirloin burger salad sirloin burgers got ...
Name: cleanText, Length: 4999, dtype: object

Lemmatization, Tokenization and removal of neutral words¶

Lemmatization : Technique used to reduce inflected words to their root word. It describes the algorithmic process of identifying an inflected word’s “lemma” (dictionary form) based on its intended meaning. \ Tokenization : spliting a text into small units called token. \ Neutral words : words like noun, verb, auxillaries etc which are not adding any meaning to the sentiment of the text.

In [23]:
import spacy
from spacy.lang.en import stop_words as spacy_stopwords
stop_words = spacy_stopwords.STOP_WORDS
nlp = spacy.load('en_core_web_lg')
extraStopwords = ['ve', 'll', 'm', 's', 'd', 'ny', 'st', 'woo', 'n', 'ish']
neutralTags = ['NN', 'NNP', 'NNS', 'NNPS', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'PRP', 'WP', 'RB', 'RBR', 'RBS', 'IN', 'DT', 'CC']
initialTags = ['ADV', 'NOUN', 'VERB', 'PROPN', 'PRON', 'AUX', 'CCONJ', 'PART', 'SYM', 'SPACE', 'PUNCT', 'DET', 'CONJ', 'X']


# lemmatization
processedReviewData['text_lemmatized']=processedReviewData['cleanText'].apply(lambda x:[token.lemma_ for token in nlp(x) if token.pos_ not in initialTags])

# rechecking for the stopwords
processedReviewData['text_lemmatized'] = processedReviewData['text_lemmatized'].apply(lambda p:[x for x in p if str(x.lower()) not in set(STOPWORDS) and str(x.lower()) not in stop_words and str(x.lower()) not in extraStopwords])

# rechecking the neutral words
processedReviewData['text_lemmatized'] = processedReviewData['text_lemmatized'].apply(lambda t: [token for token in t if nltk.pos_tag([token])[0][1] not in neutralTags])


processedReviewData['text_lemmatized']
Out[23]:
0                                    [obvious, different]
1                                   [new, special, ready]
2                  [knowledgeable, personable, fantastic]
3                                [nice, different, great]
4                           [fantastic, good, good, good]
                              ...                        
4994    [brazilian, bad, ineffective, complete, ineffe...
4995    [red, busy, typical, good, open, great, able, ...
4996                              [ultimate, busy, extra]
4997    [great, walkable, accessible, nice, expensive,...
4998                           [small, live, busy, grand]
Name: text_lemmatized, Length: 4999, dtype: object

For LDA

In [24]:
# lemmatization
initialTagsLDA = ['ADV', 'PRON', 'AUX', 'CCONJ', 'PART', 'SYM', 'SPACE', 'PUNCT', 'DET', 'CONJ', 'X', 'ADJ']
processedReviewData['token_lda']=processedReviewData['cleanText'].apply(lambda x:[token.lemma_ for token in nlp(x) if token.pos_ not in initialTagsLDA])

# rechecking for the stopwords
processedReviewData['token_lda'] = processedReviewData['token_lda'].apply(lambda p:[x for x in p if str(x.lower()) not in set(STOPWORDS) and str(x.lower()) not in stop_words and str(x.lower()) not in extraStopwords])


# rechecking the neutral words
# neutralTags = ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'PRP', 'WP', 'RB', 'RBR', 'RBS', 'IN', 'DT', 'CC']
# processedReviewData['text_lemmatized'] = processedReviewData['text_lemmatized'].apply(lambda t: [token for token in t if nltk.pos_tag([token])[0][1] not in neutralTags])

processedReviewData['token_lda']
Out[24]:
0       [lunch, burger, meh, focus, burger, crap, pile...
1       [need, tire, wife, car, order, day, drop, morn...
2       [jim, woltman, work, goleta, honda, star, job,...
3       [time, shrimp, selection, fish, price, seafood...
4       [place, eat, need, snack, price, staff, place,...
                              ...                        
4994    [think, place, buy, groupon, year, ago, arriva...
4995    [robin, reopen, tucson, mallthis, place, open,...
4996    [pizza, place, love, crust, topping, delivery,...
4997    [locate, spot, downtown, beach, door, wear, sn...
4998    [sirloin, burger, salad, sirloin, burger, chic...
Name: token_lda, Length: 4999, dtype: object

Checking out the words and their frequency

In [25]:
from collections import Counter

def get_all_lemmas(data):
    all_lemmas = []
    for tokens in data:
        all_lemmas.extend(tokens)
    return all_lemmas

all_lemmas = get_all_lemmas(processedReviewData.text_lemmatized)

# Count
lemmas_freq = Counter(all_lemmas)
common_lemmas = lemmas_freq.most_common(100)
print (common_lemmas, len(common_lemmas))
[('good', 2759), ('great', 2101), ('nice', 783), ('little', 602), ('bad', 490), ('new', 481), ('fresh', 445), ('small', 441), ('happy', 357), ('different', 342), ('hot', 330), ('big', 312), ('delicious', 268), ('large', 260), ('old', 259), ('busy', 226), ('special', 215), ('high', 207), ('huge', 195), ('free', 191), ('fantastic', 189), ('local', 189), ('open', 188), ('able', 183), ('extra', 167), ('attentive', 140), ('overall', 140), ('wrong', 139), ('second', 133), ('easy', 125), ('disappointed', 124), ('ready', 122), ('reasonable', 121), ('available', 120), ('short', 120), ('horrible', 116), ('entire', 114), ('terrible', 113), ('professional', 110), ('real', 109), ('hard', 108), ('comfortable', 98), ('regular', 97), ('french', 96), ('low', 91), ('expensive', 89), ('main', 82), ('red', 80), ('authentic', 80), ('italian', 79), ('black', 78), ('live', 77), ('poor', 74), ('knowledgeable', 72), ('outstanding', 72), ('white', 70), ('incredible', 70), ('green', 66), ('average', 65), ('solid', 62), ('chinese', 62), ('soft', 62), ('tiny', 60), ('healthy', 60), ('young', 59), ('true', 58), ('usual', 57), ('single', 55), ('complete', 51), ('personal', 51), ('basic', 50), ('quiet', 49), ('normal', 49), ('casual', 49), ('exceptional', 49), ('safe', 48), ('typical', 48), ('possible', 47), ('fabulous', 47), ('difficult', 47), ('generous', 47), ('satisfied', 47), ('similar', 45), ('traditional', 45), ('original', 44), ('major', 44), ('courteous', 44), ('impressed', 44), ('recent', 43), ('total', 43), ('strong', 43), ('additional', 40), ('vegetarian', 40), ('negative', 40), ('affordable', 40), ('willing', 38), ('classic', 38), ('clear', 37), ('tough', 36), ('positive', 36)] 100

Word cloud after cleaning the text¶

In [26]:
plt.figure(figsize=(20,10))
#Creating the text variable
textWordCloudAfter = " ".join(cat for cat in processedReviewData['text_lemmatized'].apply(lambda review: ' '.join(review)))
# Creating word_cloud with text as argument in .generate() method
word_cloud = WordCloud(collocations = False, background_color = 'white', width=2000, height=1000).generate(textWordCloudAfter)
# Display the generated Word Cloud
plt.imshow(word_cloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Creating bi-grams and trigrams¶

A Bigram takes a sentence and gives us sets of two consecutive words in the sentence. A Trigram gives sets of three consecutive words in a sentence. moreoften a phrase is having much sense than a single words, a phrase with 2 words are bigram and 3 words are trigram.

In [27]:
processedReviewData['text_lemmatized_ngram']=processedReviewData['cleanText'].apply(lambda x: [token.lemma_ for token in nlp(x) if token.pos_ not in ["NOUN", "PRON", 'PROPN', 'X']])

# rechecking for the stopwords
processedReviewData['text_lemmatized_ngram'] = processedReviewData['text_lemmatized_ngram'].apply(lambda p:[x for x in p if str(x.lower()) not in set(STOPWORDS) and str(x.lower()) not in stop_words and str(x.lower()) not in extraStopwords])

# rechecking the neutral words
processedReviewData['text_lemmatized_ngram'] = processedReviewData['text_lemmatized_ngram'].apply(lambda t: [token for token in t if nltk.pos_tag([token])[0][1] not in ['NN', 'NNP', 'NNS', 'NNPS', 'PRP', 'WP']])


processedReviewData['text_lemmatized_ngram']
Out[27]:
0       [obvious, different, appear, preformed, contrary]
1                            [new, special, later, ready]
2                  [knowledgeable, personable, fantastic]
3                                [nice, different, great]
4                 [fantastic, good, good, good, friendly]
                              ...                        
4994    [buy, ago, reluctantly, brazilian, bad, comple...
4995    [excited, red, reopen, close, busy, open, typi...
4996      [ultimate, friendly, busy, occasionally, extra]
4997    [great, fairly, close, walkable, accessible, n...
4998    [come, small, live, away, let, leave, busy, cl...
Name: text_lemmatized_ngram, Length: 4999, dtype: object
In [29]:
# Assuming 'text_lemmatized_ngram' column contains tokenized text

# Function to generate bigrams
def generate_bigrams(token):
    if len(token) >= 2:
        return [' '.join(t) for t in list(nltk.bigrams(token))]
    else:
        return []

# Function to generate trigrams
def generate_trigrams(token):
    if len(token) >= 3:
        return [' '.join(t) for t in list(nltk.trigrams(token))]
    else:
        return []

# Apply functions to create bigrams and trigrams
processedReviewData['text_bigrams'] = processedReviewData['text_lemmatized_ngram'].apply(generate_bigrams)
processedReviewData['text_trigrams'] = processedReviewData['text_lemmatized_ngram'].apply(generate_trigrams)
In [30]:
def get_ngrams(data, common):
    lemma_ngram = []
    for tokens in data:
        lemma_ngram.extend(tokens)
    # Count
    lemmas_freq_ngram = Counter(lemma_ngram)
    return lemmas_freq_ngram.most_common(common)
   

print("Bigrams -- \n", get_ngrams(processedReviewData['text_bigrams'], 100))
print("Trigrams -- \n", get_ngrams(processedReviewData['text_trigrams'], 50))
Bigrams -- 
 [('good good', 243), ('great great', 185), ('good great', 143), ('pretty good', 128), ('good like', 120), ('great good', 113), ('like like', 99), ('good come', 86), ('like good', 74), ('great friendly', 71), ('come come', 69), ('come good', 66), ('great like', 62), ('good friendly', 56), ('definitely come', 55), ('friendly great', 54), ('good definitely', 52), ('come like', 52), ('let know', 51), ('like come', 49), ('great definitely', 49), ('great nice', 48), ('amazing great', 48), ('friendly good', 47), ('good little', 47), ('good amazing', 46), ('like great', 46), ('know good', 46), ('great come', 46), ('probably good', 43), ('good nice', 42), ('great little', 41), ('great amazing', 41), ('nice great', 38), ('good long', 38), ('nice good', 37), ('come great', 36), ('great fresh', 35), ('good small', 34), ('good pretty', 34), ('like know', 33), ('amazing good', 32), ('good fresh', 31), ('great happy', 30), ('know great', 29), ('good leave', 28), ('nice like', 28), ('nice nice', 28), ('fresh good', 28), ('come know', 27), ('fresh great', 27), ('good know', 26), ('bad come', 26), ('know like', 26), ('overall good', 26), ('know know', 26), ('hot good', 26), ('different good', 25), ('like little', 25), ('good bad', 25), ('good highly', 24), ('new new', 24), ('little like', 24), ('like nice', 23), ('like pretty', 23), ('like friendly', 23), ('friendly attentive', 23), ('overall great', 23), ('small good', 22), ('come pretty', 22), ('nice come', 22), ('expect good', 22), ('absolutely delicious', 22), ('know come', 22), ('good large', 22), ('great pretty', 22), ('definitely good', 22), ('come hot', 22), ('nice friendly', 22), ('far good', 22), ('come nice', 22), ('hot hot', 21), ('little great', 21), ('great small', 21), ('good overall', 21), ('long come', 21), ('bad good', 21), ('little good', 20), ('great know', 20), ('good new', 20), ('friendly come', 20), ('long good', 20), ('good different', 19), ('finally come', 19), ('like long', 19), ('great old', 19), ('leave like', 19), ('come fresh', 19), ('special good', 18), ('old like', 18)]
Trigrams -- 
 [('great great great', 32), ('good good good', 26), ('good good great', 15), ('like pretty good', 11), ('pretty good good', 11), ('great great good', 11), ('pretty good like', 10), ('good great great', 10), ('pretty good great', 9), ('good good friendly', 8), ('great good good', 8), ('great friendly great', 8), ('good nice good', 7), ('good great good', 7), ('good good like', 7), ('good good definitely', 7), ('actually pretty good', 7), ('like great great', 7), ('great friendly good', 6), ('good good long', 6), ('know great great', 6), ('great good friendly', 6), ('amazing great definitely', 6), ('amazing good great', 6), ('like good good', 6), ('good good come', 6), ('great pretty good', 5), ('good like like', 5), ('leave like come', 5), ('know good good', 5), ('great like good', 5), ('good come good', 5), ('good know good', 5), ('great good great', 5), ('like good like', 5), ('good like pretty', 5), ('good pretty good', 5), ('come good good', 5), ('good like good', 5), ('good amazing good', 5), ('pretty good amazing', 5), ('good good little', 5), ('good friendly good', 5), ('like like like', 5), ('like come like', 5), ('pretty good little', 5), ('good great little', 4), ('come like come', 4), ('good friendly great', 4), ('great fresh great', 4)]
In [31]:
# import and instantiate the vectorizer
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

# vectorize the lemmatized text
bagWords = cv.fit_transform(processedReviewData['text_lemmatized'].astype(str))
bagWords.shape
Out[31]:
(4999, 877)

TF-IDF¶

  • Tf stands for term frequency, the number of times the word appears in each document.

  • Idf stands for inverse document frequency, an inverse count of the number of documents a word appears in. Idf measures how significant a word is in the whole corpus.

image.png

In [32]:
# import and instantiate the vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

# apply the vectorizer to the corpus
idfVector = vectorizer.fit_transform(processedReviewData['text_lemmatized'].astype(str))

# display the document-term matrix
vocab = vectorizer.get_feature_names_out()
print(idfVector.shape)
vocab
(4999, 877)
Out[32]:
array(['able', 'academic', 'acceptable', 'accessable', 'accessible',
       'accountable', 'acknowledgable', 'acoustic', 'active', 'actual',
       'addictive', 'additional', 'adjustable', 'adorable', 'advanced',
       'adventuous', 'adventuredrive', 'adventurous', 'aerial',
       'aesthetic', 'affected', 'affordable', 'aggressive',
       'agricultural', 'alcoholic', 'alive', 'alonecasual',
       'alwaysavailable', 'amateurish', 'ambiguous', 'ambitious',
       'american', 'americanitalian', 'amicable', 'amish', 'anal',
       'angry', 'annual', 'anxious', 'apathetic', 'apocalyptic',
       'apologetic', 'appalled', 'applicable', 'appreciative',
       'apprehensive', 'approachable', 'appy', 'aquatic', 'architectural',
       'argentinian', 'armored', 'arrive', 'arrogant', 'artful',
       'artificial', 'artistic', 'asian', 'asthmatic', 'astronomical',
       'athenian', 'athletic', 'atrocious', 'attentive', 'attractive',
       'audible', 'australian', 'authentic', 'autistic', 'automatic',
       'auxiliary', 'available', 'average', 'averagetypical', 'avian',
       'aware', 'bad', 'barnesnoble', 'basic', 'belgian', 'best', 'big',
       'bilingual', 'billion', 'biodegradable', 'black', 'blasphemous',
       'boisterous', 'bombulicious', 'brazilian', 'british', 'broad',
       'bureaucratic', 'busy', 'capable', 'casual', 'catastrophic',
       'cathedral', 'cautious', 'cdelicious', 'central', 'ceramic',
       'certain', 'chaotic', 'charismatic', 'charitable', 'chic',
       'chinese', 'chiropractic', 'chronological', 'circuitous',
       'citizenmoral', 'civil', 'classic', 'classical', 'clear',
       'clinical', 'cobble', 'colombian', 'comfortable', 'comic',
       'comical', 'commercial', 'common', 'comparable', 'competitive',
       'complete', 'complex', 'composite', 'concerned', 'configurable',
       'confortable', 'conscious', 'consecutive', 'conservative',
       'considerable', 'conspicuous', 'constant', 'contagious',
       'contemporary', 'continued', 'contrarian', 'contrary',
       'conventional', 'copious', 'corporate', 'cosmetic', 'costly',
       'courteous', 'cous', 'cozyromanticrustic', 'crappy', 'creative',
       'criminal', 'critical', 'cultural', 'curious', 'current',
       'customary', 'customizable', 'cylindrical', 'daily', 'dangerous',
       'dead', 'decipherable', 'deductible', 'deeeeeeelicious',
       'defective', 'definitive', 'delectable', 'delicious', 'delighted',
       'demographic', 'dependable', 'deplorable', 'desirable', 'detailed',
       'diabetic', 'diagnostic', 'dian', 'dietary', 'different',
       'difficult', 'direct', 'disabled', 'disappointed', 'disposable',
       'dissatisfied', 'distinguishable', 'diuretic', 'doable',
       'domestic', 'dooable', 'draconian', 'dramatic', 'drinkable',
       'drinkingable', 'drinksmiscellaneous', 'drippy', 'drivable',
       'dynamic', 'earthconscious', 'eastern', 'easy', 'eatable',
       'eccentric', 'eclectic', 'economical', 'ecstatic', 'ecuadorian',
       'edible', 'educational', 'effective', 'egregious', 'electric',
       'electrical', 'electronic', 'elementary', 'elusive', 'emblematic',
       'emotional', 'empathetic', 'energetic', 'english', 'enjoyable',
       'enjoyedive', 'enormous', 'entensive', 'enthusiastic', 'entire',
       'environmental', 'equal', 'erratic', 'especial', 'essential',
       'eternal', 'ethical', 'ethiopian', 'ethnic', 'ethnicnational',
       'ethopian', 'european', 'everpretentious', 'exceptional',
       'excessive', 'exclusive', 'exemplary', 'exhaustive', 'existential',
       'exotic', 'expansive', 'expensive', 'experienced', 'extensive',
       'extra', 'extraordinary', 'fabulous', 'facial', 'factual', 'false',
       'familiar', 'famous', 'fanatic', 'fantastic', 'fashionable',
       'fastcasual', 'favorable', 'federal', 'festive', 'final',
       'financial', 'fixable', 'flat', 'flexible', 'floppy', 'floral',
       'focal', 'fondue', 'foodive', 'foolish', 'foreign', 'foreseeable',
       'forgettable', 'formal', 'formidable', 'fourth', 'fractional',
       'free', 'french', 'fresh', 'functional', 'furious', 'geneous',
       'general', 'generous', 'german', 'gigantic', 'ginormous',
       'glamorous', 'global', 'gloppy', 'glorious', 'gneeral', 'golden',
       'good', 'gorgeous', 'gracious', 'grand', 'graphic', 'great',
       'green', 'gross', 'guilty', 'happy', 'hard', 'hawaiian', 'healthy',
       'heavy', 'hectic', 'heretypical', 'hermetic', 'hideous', 'high',
       'hilarious', 'hippy', 'hispanic', 'historical', 'honorable',
       'hoppy', 'horrendous', 'horrible', 'horticultural', 'hospitable',
       'hot', 'huge', 'humble', 'humorous', 'hypersexual', 'hypothetical',
       'iced', 'identical', 'illegal', 'imaginative', 'immersive',
       'impeccable', 'imperial', 'important', 'impossible', 'impractical',
       'impressed', 'impressive', 'inattentive', 'incapable', 'inclined',
       'inclusive', 'incredible', 'indecisive', 'independent',
       'indescribable', 'indian', 'indicative', 'individual',
       'industrial', 'inedible', 'ineffective', 'inevitable',
       'inexcusable', 'inexpensive', 'inexperienced', 'inexplicable',
       'infamous', 'infectious', 'inflexible', 'influential', 'informal',
       'informational', 'informative', 'ingenious', 'initial',
       'innocuous', 'innovative', 'inoperable', 'institutional',
       'instructional', 'instrumental', 'insultingive', 'insurmountable',
       'intact', 'intensive', 'intentional', 'interactive', 'interested',
       'internal', 'international', 'intrusive', 'iralian', 'irish',
       'irreplaceable', 'irritable', 'irritated', 'isolated', 'israeli',
       'italian', 'japanese', 'jealous', 'knowledgable', 'knowledgeable',
       'lackadaisical', 'large', 'laughable', 'lavish', 'lebanese',
       'legal', 'legendary', 'legislative', 'lest', 'liberal', 'likely',
       'limited', 'literal', 'little', 'live', 'livemusic',
       'livingsocial', 'local', 'longish', 'loose', 'low', 'lucky',
       'lunchcomplimentary', 'lunchvegetarian', 'luscious', 'luxurious',
       'magical', 'main', 'majestic', 'major', 'manageable', 'manual',
       'manyunbelievable', 'married', 'marvelous', 'masochistic',
       'massive', 'mechanical', 'medical', 'memorable', 'meticulous',
       'mexican', 'microwaveable', 'military', 'million', 'minimalistic',
       'miraculous', 'miserable', 'mixed', 'modern', 'modest',
       'monotonous', 'monstrous', 'moral', 'moroccan', 'municipal',
       'musical', 'mysterious', 'naked', 'nasty', 'national',
       'nationwide', 'native', 'natural', 'naturalistic', 'nearest',
       'necessary', 'negative', 'neglectful', 'nervous', 'neurological',
       'neutral', 'new', 'newagepostmodern', 'nice', 'nitrous', 'noble',
       'nomadfive', 'nominal', 'nonalcoholic', 'nonfunctional',
       'nonintrusive', 'noninvasive', 'nonrefundable', 'nonreturnable',
       'nonvegetarian', 'nonverbal', 'normal', 'northern', 'norwegian',
       'notable', 'notary', 'noteworthy', 'noticeable', 'notorious',
       'noxious', 'nuclear', 'numerous', 'nutritional', 'oblivious',
       'obnoxious', 'obvious', 'occasional', 'offensive', 'ohsocrucial',
       'oilbalasmic', 'old', 'olive', 'open', 'opentable', 'operational',
       'optimistic', 'optional', 'oral', 'ordinary', 'organic',
       'orgasmic', 'oriental', 'original', 'orleanian', 'ostentatious',
       'outrageous', 'outstanding', 'overall', 'overdue', 'overfive',
       'overwhelmed', 'palatable', 'parisian', 'partial', 'particular',
       'passable', 'passible', 'pathetic', 'personable', 'personal',
       'peruvian', 'phantasmagorical', 'physical', 'pittsburghian',
       'pleased', 'pleasurable', 'polish', 'political', 'poor', 'poppy',
       'popular', 'portable', 'portuguese', 'positive', 'possible',
       'potatoarugulacheese', 'potential', 'powerful', 'practical',
       'precious', 'predictable', 'preliminary', 'prepared',
       'pretentious', 'previous', 'private', 'problematic',
       'professional', 'prophetic', 'prosthetic', 'provencal',
       'puddingfantastic', 'punctual', 'questionable', 'quiet',
       'quintessential', 'racial', 'raucous', 'ready', 'real',
       'realistic', 'reasonable', 'recent', 'recyclable', 'red',
       'refundable', 'regional', 'regrettable', 'regular', 'relatable',
       'related', 'reliable', 'religious', 'remarkable', 'reputable',
       'residential', 'residual', 'respectable', 'responsible', 'retail',
       'reusable', 'revolutionary', 'rican', 'rich', 'rid', 'ridiculous',
       'righteous', 'romantic', 'rural', 'russian', 'rustic', 'safe',
       'salvadorian', 'sanitary', 'sarcastic', 'satisfied',
       'saturdayincredible', 'saucecheese', 'scan', 'scandinavian',
       'scary', 'scientific', 'scottish', 'scrumptious', 'seasonal',
       'seatingcomplimentary', 'second', 'sectional', 'seinfeldian',
       'semiauthentic', 'semiinconspicuous', 'senior', 'sensitive',
       'separate', 'serviceable', 'seven', 'severe', 'sexual', 'sharable',
       'shareable', 'sharp', 'short', 'shrimpcheckbavarian', 'siamese',
       'sichuanese', 'sicilian', 'significant', 'similar', 'simplistic',
       'single', 'sizable', 'skeptical', 'sloppy', 'small', 'snappy',
       'snippy', 'socal', 'sociable', 'social', 'soft', 'solid',
       'sophisticated', 'southern', 'spacious', 'spanish', 'spastic',
       'special', 'specialized', 'specific', 'spiritual', 'spontaneous',
       'spreadable', 'stable', 'starsmusic', 'stationary', 'steady',
       'strategic', 'strenuous', 'strong', 'stupendous', 'stupid',
       'substantial', 'successful', 'sudden', 'sugary', 'suitable',
       'sumptuous', 'superior', 'supernatural', 'surgical', 'surprised',
       'suspicious', 'sustainable', 'swedish', 'swiss', 'symmetrical',
       'sympathetic', 'synonymous', 'synthetic', 'taiwanese', 'tastic',
       'tawainese', 'technical', 'temporary', 'tenuous', 'terrible',
       'textural', 'thailaotian', 'therapeutic', 'thetable', 'thoughtful',
       'tiny', 'tolerable', 'topadrian', 'total', 'touchable', 'tough',
       'tradional', 'traditional', 'tragicomic', 'transamerican',
       'transformational', 'traumatic', 'treacherous', 'tremendous',
       'tropical', 'troubled', 'true', 'tuscan', 'typical', 'ukrainian',
       'ultimate', 'ultra', 'unable', 'unacceptable', 'unannounced',
       'unapologetic', 'unapproachable', 'unattended', 'unattentive',
       'unattractive', 'unavailable', 'unbearable', 'unbeatable',
       'unbelievable', 'unbiased', 'unborn', 'uncanny', 'unclean',
       'unclear', 'unclogged', 'uncomfortable', 'uncommon', 'unconcerned',
       'unconventional', 'uncorked', 'undeniable', 'undercooked',
       'underdressed', 'underrated', 'undersalted', 'underseasoned',
       'understandable', 'undertrained', 'underwhelmed', 'undivided',
       'uneasy', 'uneaten', 'uneducated', 'unenthusiastic', 'unethical',
       'uneventful', 'unexpected', 'unflavored', 'unfounded', 'unglued',
       'ungrateful', 'unhappy', 'unhealthy', 'unheard', 'unhelpful',
       'uni', 'unicorn', 'unidentifiable', 'unidentified', 'unimpressed',
       'uninformed', 'uninterested', 'unknown', 'unlikely', 'unlimited',
       'unlucky', 'unmanned', 'unmarked', 'unmelted', 'unmemorable',
       'unmitigated', 'unnecessary', 'unnoticed', 'unobtrusive',
       'unorganized', 'unpacked', 'unpalatable', 'unpaved', 'unpleasant',
       'unpredictable', 'unprepared', 'unprofessional', 'unproffessional',
       'unreasonable', 'unremarkable', 'unresponsive', 'unrivaled',
       'unsafe', 'unsanitary', 'unsatisfied', 'unseasoned', 'unseen',
       'unserved', 'unsimilar', 'unspectacular', 'unsuccessful',
       'unsweetened', 'untoasted', 'untouched', 'untrained', 'untreated',
       'untrue', 'unusable', 'unusual', 'unwanted', 'unwarranted',
       'unwilling', 'unwrapped', 'uplandwant', 'upper', 'upright',
       'upsale', 'upscale', 'uptown', 'urban', 'useable', 'useful',
       'usual', 'valid', 'valuable', 'vegetarian', 'venetian',
       'veterinarian', 'viable', 'victorian', 'vietnamese', 'viewable',
       'vigorous', 'virtual', 'visible', 'walkable', 'wary',
       'wastedtypical', 'weak', 'weary', 'weekly', 'western', 'whimsical',
       'white', 'whogotwhat', 'wide', 'widespread', 'willing', 'wondrous',
       'workpersonal', 'worried', 'wrong', 'young'], dtype=object)

Word Embeddings (FastText)¶

In [33]:
from gensim.models import FastText

model_ted = FastText(processedReviewData['text_lemmatized'], vector_size=500, window=3, min_count=3, workers=4,sg=1)

wordFastText = pd.concat([pd.DataFrame(model_ted.wv.index_to_key, columns=['words']), pd.DataFrame(model_ted.wv.vectors)], axis=1)

wordFastText
Out[33]:
words 0 1 2 3 4 5 6 7 8 ... 490 491 492 493 494 495 496 497 498 499
0 good 0.007585 0.020563 0.031869 -0.023586 -0.092026 0.118824 -0.010433 0.071385 0.048883 ... 0.015999 -0.053557 0.086451 0.079166 0.017622 -0.039000 -0.011855 0.017838 0.021071 -0.063291
1 great 0.006736 0.021745 0.031971 -0.023755 -0.092718 0.120262 -0.011136 0.071410 0.048961 ... 0.016211 -0.053955 0.087280 0.080063 0.017959 -0.039566 -0.012185 0.017989 0.021260 -0.063534
2 nice 0.006962 0.020661 0.031053 -0.022866 -0.092167 0.118266 -0.011579 0.070583 0.048773 ... 0.014773 -0.052688 0.086204 0.078155 0.017732 -0.039316 -0.012118 0.017011 0.020682 -0.061572
3 little 0.006871 0.021982 0.031937 -0.023956 -0.094171 0.121453 -0.011230 0.072278 0.049842 ... 0.016465 -0.054544 0.088703 0.081260 0.018051 -0.040710 -0.012799 0.018143 0.021881 -0.064991
4 bad 0.006066 0.020306 0.030078 -0.023124 -0.087307 0.114183 -0.011509 0.068787 0.047072 ... 0.015191 -0.050275 0.082482 0.074933 0.017196 -0.038018 -0.011489 0.015674 0.019779 -0.060988
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
407 predictable 0.005787 0.019836 0.029363 -0.022185 -0.085495 0.111149 -0.010124 0.066129 0.045813 ... 0.015303 -0.049326 0.080736 0.073998 0.016647 -0.036347 -0.011511 0.016032 0.019480 -0.058990
408 asthmatic 0.005678 0.017467 0.026198 -0.019846 -0.075831 0.098743 -0.009328 0.058476 0.040706 ... 0.013497 -0.043997 0.071335 0.065167 0.014708 -0.032377 -0.010231 0.014353 0.017530 -0.051897
409 inevitable 0.007030 0.022028 0.032665 -0.024929 -0.095467 0.124766 -0.011445 0.074270 0.051407 ... 0.017394 -0.055469 0.090648 0.083029 0.018738 -0.040776 -0.012426 0.018116 0.021997 -0.065973
410 cautious 0.006735 0.021391 0.031001 -0.023805 -0.092015 0.119459 -0.010759 0.070910 0.049286 ... 0.016081 -0.053000 0.086676 0.079199 0.017659 -0.038975 -0.012009 0.017500 0.021157 -0.063003
411 inclusive 0.005418 0.017761 0.027105 -0.020305 -0.077979 0.101299 -0.009176 0.060560 0.041862 ... 0.013913 -0.044960 0.073105 0.066813 0.015029 -0.032807 -0.010161 0.014970 0.017935 -0.053572

412 rows × 501 columns

Bigram and trigram vector data¶

In [34]:
from gensim.models.phrases import Phrases, Phraser, ENGLISH_CONNECTOR_WORDS

def bigram2vec(unigrams):
    bigram = Phraser(Phrases(unigrams, min_count=3, connector_words=ENGLISH_CONNECTOR_WORDS))
    trigram = Phraser(Phrases(bigram[unigrams], min_count=1, connector_words=ENGLISH_CONNECTOR_WORDS))
    return FastText(trigram[bigram[unigrams]], min_count=3,vector_size=500)

resBigram = bigram2vec(processedReviewData['text_lemmatized'])
FastTextGram = pd.concat([pd.DataFrame(resBigram.wv.index_to_key, columns=['words']), pd.DataFrame(resBigram.wv.vectors)], axis=1)
FastTextGram
Out[34]:
words 0 1 2 3 4 5 6 7 8 ... 490 491 492 493 494 495 496 497 498 499
0 good -0.004363 0.059130 -0.033575 -0.015365 -0.055984 0.229594 0.101116 0.099879 0.124248 ... 0.035882 -0.106193 0.142355 0.153548 0.023196 -0.068305 -0.101732 -0.012262 0.025996 -0.063845
1 great -0.005083 0.055807 -0.031517 -0.014320 -0.051816 0.213626 0.093615 0.092167 0.115060 ... 0.033472 -0.099019 0.132674 0.143456 0.021747 -0.063909 -0.095276 -0.011507 0.024223 -0.059054
2 nice -0.004621 0.057925 -0.033440 -0.014672 -0.055553 0.224275 0.097540 0.097148 0.121094 ... 0.033674 -0.103686 0.139565 0.149776 0.022943 -0.067398 -0.099848 -0.012316 0.025370 -0.060970
3 little -0.005094 0.060075 -0.034196 -0.015304 -0.056058 0.229196 0.100533 0.098904 0.123695 ... 0.035680 -0.106335 0.142592 0.153924 0.023370 -0.069184 -0.102532 -0.012338 0.026522 -0.063816
4 bad -0.005169 0.061367 -0.034739 -0.016215 -0.056921 0.235274 0.102217 0.102554 0.127276 ... 0.036412 -0.108640 0.146176 0.156994 0.024103 -0.070742 -0.104594 -0.013407 0.026801 -0.065800
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
410 unremarkable -0.003312 0.040384 -0.022834 -0.010332 -0.037548 0.154411 0.068190 0.066807 0.083270 ... 0.024341 -0.071124 0.095956 0.103041 0.015808 -0.046174 -0.068809 -0.008217 0.017747 -0.042761
411 intrusive -0.003504 0.039752 -0.022503 -0.010556 -0.037427 0.153134 0.067137 0.066405 0.082567 ... 0.023477 -0.070610 0.095049 0.102024 0.015873 -0.045570 -0.068041 -0.008028 0.017643 -0.042493
412 laughable -0.004042 0.049433 -0.027913 -0.012833 -0.046223 0.189718 0.083576 0.082299 0.102606 ... 0.029824 -0.087221 0.117993 0.127240 0.019594 -0.056342 -0.084293 -0.010275 0.021348 -0.052545
413 social_safe -0.001579 0.021694 -0.012124 -0.005535 -0.019793 0.082633 0.036100 0.035450 0.044777 ... 0.012776 -0.038453 0.051318 0.055270 0.008523 -0.024798 -0.036753 -0.004801 0.009108 -0.023116
414 avian -0.003986 0.042184 -0.024468 -0.010726 -0.039081 0.160163 0.070256 0.069091 0.086504 ... 0.024721 -0.073872 0.099148 0.107591 0.016270 -0.047616 -0.071266 -0.009398 0.018163 -0.044432

415 rows × 501 columns

Sentiment analysis¶

TextBlob¶

Polarity determines the sentiment of the text. Its values lie in [-1,1] where -1 denotes a highly negative sentiment and 1 denotes a highly positive sentiment.

Subjectivity determines whether a text input is factual information or a personal opinion. Its value lies between [0,1] where a value closer to 0 denotes a piece of factual information and a value closer to 1 denotes a personal opinion.

In [35]:
from textblob import TextBlob
processedReviewData['text_polarity']= processedReviewData['cleanText'].apply(lambda x: TextBlob(x).sentiment.polarity)
processedReviewData['text_subjectivity']= processedReviewData['cleanText'].apply(lambda x: TextBlob(x).sentiment.subjectivity)

processedReviewData['textBlobSentiments'] = np.where(processedReviewData['text_polarity']>0, 1, 0)
In [36]:
fig = plt.figure(figsize=(20, 5), tight_layout=True)

plt.subplot(1, 2, 1)
textSent = sns.countplot(x='stars', hue='textBlobSentiments', data=processedReviewData)
for p in textSent.patches:
    txt = str(p.get_height())
    txt_x = p.get_x()
    txt_y = p.get_height()
    textSent.text(txt_x, txt_y, txt, size=14)
plt.title("Textblob sentiment analysis for review")
plt.xlabel("Review stars (ratings)")
plt.ylabel("Number of reviews")
plt.legend(["Negative", "Positive"])

plt.subplot(1, 2, 2)
sns.kdeplot(data=processedReviewData, x='text_polarity', hue='stars', palette="Set1")
plt.title("Textblob sentiment polarity distribution")
plt.xlabel("Polarity")
Out[36]:
Text(0.5, 0, 'Polarity')

According to the polarity values, 981 reviews are negative or close to negative.

VADER (Valence Aware Dictionary and sEntiment Reasoner)¶

The Compound score is a metric that calculates the sum of all the lexicon ratings which have been normalized between -1(most extreme negative) and +1 (most extreme positive).\ positive sentiment : (compound score >= 0.05) \ neutral sentiment : (compound score > -0.05) and (compound score < 0.05) \ negative sentiment : (compound score <= -0.05)

In [37]:
pip install vaderSentiment
Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 126.0/126.0 kB 6.1 MB/s eta 0:00:00
Requirement already satisfied: requests in /opt/conda/lib/python3.10/site-packages (from vaderSentiment) (2.31.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests->vaderSentiment) (3.1.0)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests->vaderSentiment) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from requests->vaderSentiment) (1.26.15)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests->vaderSentiment) (2023.7.22)
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2
Note: you may need to restart the kernel to use updated packages.
In [38]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

sentiment = SentimentIntensityAnalyzer()

processedReviewData['vadar_polarity']= processedReviewData['cleanText'].apply(lambda x: sentiment.polarity_scores(x)['compound'])

processedReviewData['vadarSentiments'] = np.where(processedReviewData['vadar_polarity']>0, 1, 0)
In [39]:
fig = plt.figure(figsize=(20, 5), tight_layout=True)

plt.subplot(1, 2, 1)
textSent = sns.countplot(x='stars', hue='vadarSentiments', data=processedReviewData)
for p in textSent.patches:
    txt = str(p.get_height())
    txt_x = p.get_x() + 0.1
    txt_y = p.get_height()
    textSent.text(txt_x, txt_y, txt, size=14)
plt.title("VADAR sentiment analysis for review")
plt.xlabel("Review stars (ratings)")
plt.ylabel("Number of reviews")
plt.legend(["Negative", "Positive"])

plt.subplot(1, 2, 2)
sns.kdeplot(data=processedReviewData, x='vadar_polarity', hue='stars', palette="Set1")
plt.title("VADAR sentiment polarity distribution")
plt.xlabel("Polarity")
Out[39]:
Text(0.5, 0, 'Polarity')
In [41]:
pip install flair
Collecting flair
  Downloading flair-0.13.0-py3-none-any.whl (387 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 387.2/387.2 kB 14.1 MB/s eta 0:00:00
Requirement already satisfied: boto3>=1.20.27 in /opt/conda/lib/python3.10/site-packages (from flair) (1.26.100)
Collecting bpemb>=0.3.2 (from flair)
  Downloading bpemb-0.3.4-py3-none-any.whl (19 kB)
Collecting conllu>=4.0 (from flair)
  Downloading conllu-4.5.3-py2.py3-none-any.whl (16 kB)
Requirement already satisfied: deprecated>=1.2.13 in /opt/conda/lib/python3.10/site-packages (from flair) (1.2.14)
Collecting ftfy>=6.1.0 (from flair)
  Downloading ftfy-6.1.3-py3-none-any.whl (53 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.4/53.4 kB 3.9 MB/s eta 0:00:00
Collecting gdown>=4.4.0 (from flair)
  Downloading gdown-4.7.1-py3-none-any.whl (15 kB)
Requirement already satisfied: gensim>=4.2.0 in /opt/conda/lib/python3.10/site-packages (from flair) (4.3.2)
Requirement already satisfied: huggingface-hub>=0.10.0 in /opt/conda/lib/python3.10/site-packages (from flair) (0.16.4)
Requirement already satisfied: janome>=0.4.2 in /opt/conda/lib/python3.10/site-packages (from flair) (0.5.0)
Requirement already satisfied: langdetect>=1.0.9 in /opt/conda/lib/python3.10/site-packages (from flair) (1.0.9)
Requirement already satisfied: lxml>=4.8.0 in /opt/conda/lib/python3.10/site-packages (from flair) (4.9.3)
Requirement already satisfied: matplotlib>=2.2.3 in /opt/conda/lib/python3.10/site-packages (from flair) (3.7.2)
Requirement already satisfied: more-itertools>=8.13.0 in /opt/conda/lib/python3.10/site-packages (from flair) (9.1.0)
Requirement already satisfied: mpld3>=0.3 in /opt/conda/lib/python3.10/site-packages (from flair) (0.5.9)
Collecting pptree>=3.1 (from flair)
  Downloading pptree-3.1.tar.gz (3.0 kB)
  Preparing metadata (setup.py) ... done
Requirement already satisfied: python-dateutil>=2.8.2 in /opt/conda/lib/python3.10/site-packages (from flair) (2.8.2)
Collecting pytorch-revgrad>=0.2.0 (from flair)
  Downloading pytorch_revgrad-0.2.0-py3-none-any.whl (4.6 kB)
Requirement already satisfied: regex>=2022.1.18 in /opt/conda/lib/python3.10/site-packages (from flair) (2023.6.3)
Requirement already satisfied: scikit-learn>=1.0.2 in /opt/conda/lib/python3.10/site-packages (from flair) (1.2.2)
Collecting segtok>=1.5.11 (from flair)
  Downloading segtok-1.5.11-py3-none-any.whl (24 kB)
Collecting sqlitedict>=2.0.0 (from flair)
  Downloading sqlitedict-2.1.0.tar.gz (21 kB)
  Preparing metadata (setup.py) ... done
Requirement already satisfied: tabulate>=0.8.10 in /opt/conda/lib/python3.10/site-packages (from flair) (0.9.0)
Requirement already satisfied: torch!=1.8,>=1.5.0 in /opt/conda/lib/python3.10/site-packages (from flair) (2.0.0+cpu)
Requirement already satisfied: tqdm>=4.63.0 in /opt/conda/lib/python3.10/site-packages (from flair) (4.66.1)
Collecting transformer-smaller-training-vocab>=0.2.3 (from flair)
  Downloading transformer_smaller_training_vocab-0.3.3-py3-none-any.whl (14 kB)
Requirement already satisfied: transformers[sentencepiece]<5.0.0,>=4.18.0 in /opt/conda/lib/python3.10/site-packages (from flair) (4.33.0)
Requirement already satisfied: urllib3<2.0.0,>=1.0.0 in /opt/conda/lib/python3.10/site-packages (from flair) (1.26.15)
Collecting wikipedia-api>=0.5.7 (from flair)
  Downloading Wikipedia_API-0.6.0-py3-none-any.whl (14 kB)
Requirement already satisfied: semver<4.0.0,>=3.0.0 in /opt/conda/lib/python3.10/site-packages (from flair) (3.0.1)
Collecting botocore<1.30.0,>=1.29.100 (from boto3>=1.20.27->flair)
  Downloading botocore-1.29.165-py3-none-any.whl (11.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11.0/11.0 MB 59.4 MB/s eta 0:00:0000:0100:01
Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /opt/conda/lib/python3.10/site-packages (from boto3>=1.20.27->flair) (1.0.1)
Requirement already satisfied: s3transfer<0.7.0,>=0.6.0 in /opt/conda/lib/python3.10/site-packages (from boto3>=1.20.27->flair) (0.6.2)
Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages (from bpemb>=0.3.2->flair) (1.23.5)
Requirement already satisfied: requests in /opt/conda/lib/python3.10/site-packages (from bpemb>=0.3.2->flair) (2.31.0)
Requirement already satisfied: sentencepiece in /opt/conda/lib/python3.10/site-packages (from bpemb>=0.3.2->flair) (0.1.99)
Requirement already satisfied: wrapt<2,>=1.10 in /opt/conda/lib/python3.10/site-packages (from deprecated>=1.2.13->flair) (1.14.1)
Collecting wcwidth<0.3.0,>=0.2.12 (from ftfy>=6.1.0->flair)
  Downloading wcwidth-0.2.12-py2.py3-none-any.whl (34 kB)
Requirement already satisfied: filelock in /opt/conda/lib/python3.10/site-packages (from gdown>=4.4.0->flair) (3.12.2)
Requirement already satisfied: six in /opt/conda/lib/python3.10/site-packages (from gdown>=4.4.0->flair) (1.16.0)
Requirement already satisfied: beautifulsoup4 in /opt/conda/lib/python3.10/site-packages (from gdown>=4.4.0->flair) (4.12.2)
Requirement already satisfied: scipy>=1.7.0 in /opt/conda/lib/python3.10/site-packages (from gensim>=4.2.0->flair) (1.11.2)
Requirement already satisfied: smart-open>=1.8.1 in /opt/conda/lib/python3.10/site-packages (from gensim>=4.2.0->flair) (6.3.0)
Requirement already satisfied: fsspec in /opt/conda/lib/python3.10/site-packages (from huggingface-hub>=0.10.0->flair) (2023.9.0)
Requirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.10/site-packages (from huggingface-hub>=0.10.0->flair) (6.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/conda/lib/python3.10/site-packages (from huggingface-hub>=0.10.0->flair) (4.6.3)
Requirement already satisfied: packaging>=20.9 in /opt/conda/lib/python3.10/site-packages (from huggingface-hub>=0.10.0->flair) (21.3)
Requirement already satisfied: contourpy>=1.0.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=2.2.3->flair) (1.1.0)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=2.2.3->flair) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=2.2.3->flair) (4.40.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=2.2.3->flair) (1.4.4)
Requirement already satisfied: pillow>=6.2.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=2.2.3->flair) (9.5.0)
Requirement already satisfied: pyparsing<3.1,>=2.3.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=2.2.3->flair) (3.0.9)
Requirement already satisfied: jinja2 in /opt/conda/lib/python3.10/site-packages (from mpld3>=0.3->flair) (3.1.2)
Requirement already satisfied: joblib>=1.1.1 in /opt/conda/lib/python3.10/site-packages (from scikit-learn>=1.0.2->flair) (1.3.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.10/site-packages (from scikit-learn>=1.0.2->flair) (3.1.0)
Requirement already satisfied: sympy in /opt/conda/lib/python3.10/site-packages (from torch!=1.8,>=1.5.0->flair) (1.12)
Requirement already satisfied: networkx in /opt/conda/lib/python3.10/site-packages (from torch!=1.8,>=1.5.0->flair) (3.1)
Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /opt/conda/lib/python3.10/site-packages (from transformers[sentencepiece]<5.0.0,>=4.18.0->flair) (0.13.3)
Requirement already satisfied: safetensors>=0.3.1 in /opt/conda/lib/python3.10/site-packages (from transformers[sentencepiece]<5.0.0,>=4.18.0->flair) (0.3.3)
Requirement already satisfied: protobuf in /opt/conda/lib/python3.10/site-packages (from transformers[sentencepiece]<5.0.0,>=4.18.0->flair) (3.20.3)
Requirement already satisfied: accelerate>=0.20.3 in /opt/conda/lib/python3.10/site-packages (from transformers[sentencepiece]<5.0.0,>=4.18.0->flair) (0.22.0)
Requirement already satisfied: soupsieve>1.2 in /opt/conda/lib/python3.10/site-packages (from beautifulsoup4->gdown>=4.4.0->flair) (2.3.2.post1)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/lib/python3.10/site-packages (from jinja2->mpld3>=0.3->flair) (2.1.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests->bpemb>=0.3.2->flair) (3.1.0)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests->bpemb>=0.3.2->flair) (3.4)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests->bpemb>=0.3.2->flair) (2023.7.22)
Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in /opt/conda/lib/python3.10/site-packages (from requests->bpemb>=0.3.2->flair) (1.7.1)
Requirement already satisfied: mpmath>=0.19 in /opt/conda/lib/python3.10/site-packages (from sympy->torch!=1.8,>=1.5.0->flair) (1.3.0)
Requirement already satisfied: psutil in /opt/conda/lib/python3.10/site-packages (from accelerate>=0.20.3->transformers[sentencepiece]<5.0.0,>=4.18.0->flair) (5.9.3)
Building wheels for collected packages: pptree, sqlitedict
  Building wheel for pptree (setup.py) ... done
  Created wheel for pptree: filename=pptree-3.1-py3-none-any.whl size=4609 sha256=f6ff97cb291db14102596680141c270807c66bff2fecbc5b71d8fbbd70922714
  Stored in directory: /root/.cache/pip/wheels/9f/b6/0e/6f26eb9e6eb53ff2107a7888d72b5a6a597593956113037828
  Building wheel for sqlitedict (setup.py) ... done
  Created wheel for sqlitedict: filename=sqlitedict-2.1.0-py3-none-any.whl size=16863 sha256=2e3c61ffb342e59bc49446cf152fa845856be7868075ef96e06ea9f5fa085a6a
  Stored in directory: /root/.cache/pip/wheels/79/d6/e7/304e0e6cb2221022c26d8161f7c23cd4f259a9e41e8bbcfabd
Successfully built pptree sqlitedict
Installing collected packages: wcwidth, sqlitedict, pptree, segtok, ftfy, conllu, wikipedia-api, botocore, pytorch-revgrad, gdown, bpemb, transformer-smaller-training-vocab, flair
  Attempting uninstall: wcwidth
    Found existing installation: wcwidth 0.2.6
    Uninstalling wcwidth-0.2.6:
      Successfully uninstalled wcwidth-0.2.6
  Attempting uninstall: botocore
    Found existing installation: botocore 1.31.17
    Uninstalling botocore-1.31.17:
      Successfully uninstalled botocore-1.31.17
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
aiobotocore 2.5.4 requires botocore<1.31.18,>=1.31.17, but you have botocore 1.29.165 which is incompatible.
Successfully installed botocore-1.29.165 bpemb-0.3.4 conllu-4.5.3 flair-0.13.0 ftfy-6.1.3 gdown-4.7.1 pptree-3.1 pytorch-revgrad-0.2.0 segtok-1.5.11 sqlitedict-2.1.0 transformer-smaller-training-vocab-0.3.3 wcwidth-0.2.12 wikipedia-api-0.6.0
Note: you may need to restart the kernel to use updated packages.
In [43]:
from flair.models import TextClassifier
from flair.data import Sentence
sia = TextClassifier.load('en-sentiment')

def flair_prediction(x):
    sentence = Sentence(x)
    sia.predict(sentence)
    return sentence.labels[0].to_dict()

resultsFlair = pd.Series(processedReviewData["cleanText"].apply(flair_prediction))

processedReviewData["flairPolarity"] = resultsFlair.apply(lambda con : con['confidence'])
processedReviewData["flairSentiment"] = resultsFlair.apply(lambda con : 1 if con['value']=="POSITIVE" else 0)


fig = plt.figure(figsize=(20, 5), tight_layout=True)

plt.subplot(1, 2, 1)
textSent = sns.countplot(x='stars', hue='flairSentiment', data=processedReviewData)
for p in textSent.patches:
    txt = str(p.get_height())
    txt_x = p.get_x() + 0.1
    txt_y = p.get_height()
    textSent.text(txt_x, txt_y, txt, size=14)
plt.title("Flair sentiment analysis for review")
plt.xlabel("Review stars (ratings)")
plt.ylabel("Number of reviews")
plt.legend(["Negative", "Positive"])

plt.subplot(1, 2, 2)
sns.kdeplot(data=processedReviewData, x='flairPolarity', hue='stars', palette="Set1")
plt.title("Flair sentiment polarity distribution")
plt.xlabel("Polarity")
Out[43]:
Text(0.5, 0, 'Polarity')

Principal component analysis¶

In [44]:
from sklearn.decomposition import PCA

# function to verify the variance acquired by the PC
def pcaFunction(data,numberOfComponent):
    pca=PCA(n_components=numberOfComponent)
    pca.fit(data)
    scree = pca.explained_variance_ratio_*100
    plt.figure(figsize=(7,5))
    plt.bar(np.arange(len(scree))+1, scree)
    plt.plot(np.arange(len(scree))+1, scree.cumsum(),c="red",marker='o')
    plt.xlabel("Number of principal components")
    plt.ylabel("Percentage explained variance")
    plt.title("Scree Plot to check variance ratio")
    plt.xticks([1, 2])
    plt.text(1.5, 25, (("Variance accumulation with\n2 components : {}%").format((np.cumsum(pca.explained_variance_ratio_)[-1]*100).round(2))), fontsize=12)
    plt.show(block=False)
    return pca
In [45]:
pcaFunction(wordFastText.iloc[:,1:], 2)
Out[45]:
PCA(n_components=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PCA(n_components=2)

Reducing the dimension of the Fasttext word embedding data with PCA

In [46]:
def pcaReduction(data):
    pca = PCA(n_components=2)
    return pd.DataFrame(pca.fit_transform(data))
pcaData = pcaReduction(wordFastText.iloc[:,1:])
pcaData['words'] = model_ted.wv.index_to_key
pcaData
Out[46]:
0 1 words
0 -0.031767 -0.004287 good
1 -0.045535 -0.004297 great
2 -0.017838 0.000361 nice
3 -0.061231 -0.000994 little
4 0.037099 0.005710 bad
... ... ... ...
407 0.070079 0.002232 predictable
408 0.222878 -0.000449 asthmatic
409 -0.093243 0.002429 inevitable
410 -0.027499 -0.001423 cautious
411 0.193112 -0.000559 inclusive

412 rows × 3 columns

t-SNE¶

Reducing the dimension of the Fasttext word embedding data

In [47]:
from sklearn.manifold import TSNE

def tsneReduction(data):
    return pd.DataFrame(TSNE(n_components=2).fit_transform(data))

tsneData = tsneReduction(wordFastText.iloc[:,1:])

tsneData['words'] = model_ted.wv.index_to_key

tsneData
Out[47]:
0 1 words
0 3.192504 -0.634607 good
1 0.113141 -1.754497 great
2 5.619693 -0.744377 nice
3 -3.572501 -1.655346 little
4 13.541284 0.572302 bad
... ... ... ...
407 17.648628 1.109645 predictable
408 31.769602 0.576736 asthmatic
409 -10.732373 -1.950683 inevitable
410 3.960903 -1.051612 cautious
411 29.460543 0.985976 inclusive

412 rows × 3 columns

Reduction for Bi/trigram word2Vec data¶

In [48]:
pcaDataGram = pcaReduction(FastTextGram.iloc[:,1:])
pcaDataGram['words'] = FastTextGram['words']

tsneDataGram = tsneReduction(FastTextGram.iloc[:,1:])
tsneDataGram['words'] = FastTextGram['words']

Clustering¶

Clustring the reduced data

In [49]:
from sklearn.cluster import KMeans, MiniBatchKMeans
from yellowbrick.cluster import KElbowVisualizer


# Instantiate the clustering model and visualizer
model = MiniBatchKMeans()
visualizer = KElbowVisualizer(model, k=(2,12), timings=False)

visualizer.fit(pcaData[[0,1]])        # Fit the data to the visualizer
visualizer.show()   
Out[49]:
<Axes: title={'center': 'Distortion Score Elbow for MiniBatchKMeans Clustering'}, xlabel='k', ylabel='distortion score'>

According to the scores Kmeans at cluster number 5 is giving good resutls.

In [50]:
def clustringBow(data, k):
    clusters = KMeans(n_clusters=k)
    clusters.fit(data)
    return clusters.labels_



pcaData['labels'] = clustringBow(wordFastText.iloc[:,1:], 5)
tsneData['labels'] = clustringBow(wordFastText.iloc[:,1:], 5)


pcaDataGram['labels'] = clustringBow(FastTextGram.iloc[:,1:], 5)
tsneDataGram['labels'] = clustringBow(FastTextGram.iloc[:,1:], 5)
In [51]:
fig, ax = plt.subplots(1, 2, figsize=(20,7), tight_layout=True)
sns.scatterplot(data = pcaData, x=0, y=1, hue="labels", ax=ax[0], palette='Set1')
ax[0].set_title('Distribution of words with PCA')
ax[0].set_xlabel('Principal component 1')
ax[0].set_ylabel('Principal component 2')

sns.scatterplot(data = tsneData, x=0, y=1, hue="labels", ax=ax[1], palette='Set1')
ax[1].set_title('Distribution of words with t-SNE')
ax[1].set_xlabel('Component 1')
ax[1].set_ylabel('Component 2')
Out[51]:
Text(0, 0.5, 'Component 2')

Bigram and Trigram dataset¶

In [52]:
fig, ax = plt.subplots(1, 2, figsize=(20,7), tight_layout=True)
sns.scatterplot(data = pcaDataGram, x=0, y=1, hue="labels", ax=ax[0], palette='Set1')
ax[0].set_title('Distribution of words with PCA')
ax[0].set_xlabel('Principal component 1')
ax[0].set_ylabel('Principal component 2')

sns.scatterplot(data = tsneDataGram, x=0, y=1, hue="labels", ax=ax[1], palette='Set1')
ax[1].set_title('Distribution of words with t-SNE')
ax[1].set_xlabel('Component 1')
ax[1].set_ylabel('Component 2')
Out[52]:
Text(0, 0.5, 'Component 2')

Number of words per cluster

In [53]:
fig= plt.figure(figsize=(10,5), tight_layout=True)

plt.subplot(1, 2, 1)
countUnigram = sns.countplot(data=pcaData, x='labels')
for p in countUnigram.patches:
    txt = str(p.get_height())
    txt_x = p.get_x() + 0.3
    txt_y = p.get_height()
    countUnigram.text(txt_x,txt_y,txt, size=14)
plt.ylabel('Number of words')
plt.xlabel('Cluster num')
plt.title('Number of words in each cluster for Unigram')  

plt.subplot(1, 2, 2)
countUnigram = sns.countplot(data=pcaDataGram, x='labels')
for p in countUnigram.patches:
    txt = str(p.get_height())
    txt_x = p.get_x() + 0.3
    txt_y = p.get_height()
    countUnigram.text(txt_x,txt_y,txt, size=14)
plt.ylabel('Number of words')
plt.xlabel('Cluster num')
plt.title('Number of words in each cluster for Bi/trigram') 
Out[53]:
Text(0.5, 1.0, 'Number of words in each cluster for Bi/trigram')

Considering Unigram : Cluster 0 is having the highest number of words followed by 3 and 1 \ Considering Bi/Trigram : Cluster 3 is having the highest number of words followed by 1 and 4

Words distribution per cluster¶

In [54]:
fig, axes = plt.subplots(1,5, figsize=(25,10), sharex=True, sharey=True)

for i, ax in enumerate(axes.flatten()):
    fig.add_subplot(ax)
    cloudClusterWords = " ".join(cat for cat in pcaData[pcaData['labels']==i]['words'])
    plt.gca().imshow(WordCloud(collocations = False, background_color = 'white', width=5000 ,height=7000, colormap='tab20').generate(cloudClusterWords))
    plt.gca().set_title('Topic ' + str(i))
    plt.gca().axis('off')

For Unigram topics 1, and 3 have pessimistic words, which can be related to service, ambiance, food, and restaurant location.

In [55]:
fig, axes = plt.subplots(1,5, figsize=(25,10), sharex=True, sharey=True)


for i, ax in enumerate(axes.flatten()):
    fig.add_subplot(ax)
    cloudClusterWords = " ".join(cat for cat in pcaDataGram[pcaDataGram['labels']==i]['words'])
    plt.gca().imshow(WordCloud(collocations = False, background_color = 'white', width=5000 ,height=7000, colormap='tab20').generate(cloudClusterWords))
    plt.gca().set_title('Topic ' + str(i))
    plt.gca().axis('off')

For Unigram, bigram, and trigram groups Topics 1 and 4 has a negative sentiment, which can be related to service, ambiance, food, and restaurant location.

Clustering with tf-idf data¶

In [56]:
# Instantiate the clustering model and visualizer
model = MiniBatchKMeans()
visualizer = KElbowVisualizer(model, k=(2,12), timings=False, metric="silhouette")

visualizer.fit(idfVector.toarray())        # Fit the data to the visualizer
visualizer.show()   
Out[56]:
<Axes: title={'center': 'Silhouette Score Elbow for MiniBatchKMeans Clustering'}, xlabel='k', ylabel='silhouette score'>

Clustering with TF-IDF data at 4 cluster and reducing the dimension to visualize.

In [57]:
from sklearn.decomposition import KernelPCA
# Clustering and reducing the dimension
idfTsneDf = pd.DataFrame(TSNE(n_components=2).fit_transform(idfVector.toarray()))
idfTsneDf['labels'] = clustringBow(idfVector.toarray(), 4)

X_pca_dim = KernelPCA(n_components=2).fit_transform(idfVector.toarray())
pca_df = pd.DataFrame(dict(x = X_pca_dim[:, 0], y = X_pca_dim[:,1], Cluster = idfTsneDf['labels'] ))
In [58]:
fig, ax = plt.subplots(1, 3, figsize=(20, 7), tight_layout=True)

# First subplot - Bar plot
sns.barplot(data=idfTsneDf.groupby('labels')['labels'].value_counts(), ax=ax[0], palette='Set1')
ax[0].set_title('Distribution of cluster labels for reviews')
ax[0].set_xlabel('Number of reviews')
ax[0].set_ylabel('Labels')

# Second subplot - Scatter plot with t-SNE
sns.scatterplot(data=idfTsneDf, x=0, y=1, hue="labels", ax=ax[1], palette='Set1')
ax[1].set_title('Distribution of words with t-SNE')
ax[1].set_xlabel('Component 1')
ax[1].set_ylabel('Component 2')

# Third subplot - Scatter plot with PCA
sns.scatterplot(x='x', y='y', data=pca_df, hue='Cluster', palette='Set1', ax=ax[2])
ax[2].set_title('Distribution of words with PCA')
ax[2].set_xlabel('Component 1')
ax[2].set_ylabel('Component 2')
Out[58]:
Text(0, 0.5, 'Component 2')

Cluster 0 is having the highest number of reviews, major cluster

In [59]:
fig, axes = plt.subplots(2,2, figsize=(20,10), sharex=True, sharey=True)
idfTsneDf['words'] = processedReviewData['text_lemmatized']
for i, ax in enumerate(axes.flatten()):
    fig.add_subplot(ax)
    cloudClusterWords = " ".join(" ".join(cat) for cat in idfTsneDf[idfTsneDf['labels']==i]['words'])
    plt.gca().imshow(WordCloud(collocations = False, background_color = 'white', colormap='tab20', max_words=50).generate(cloudClusterWords))
    plt.gca().set_title('Topic ' + str(i))
    plt.gca().axis('off')

Topic 3 is having negative or close to negative words

Topic Modelling with LDA (Latent Dirichlet Allocation)¶

Latent Dirichlet Allocation (LDA) is a probabilistic model that assumes that every topic is a bag of words and every document is a bag of topics that each can be chosen with from the bag with some probability.

In [60]:
from gensim.models import ldamodel
from gensim.corpora.dictionary import Dictionary

dictionary = Dictionary(processedReviewData['token_lda'])
corpus_bow = [dictionary.doc2bow(text) for text in processedReviewData['token_lda']]
In [61]:
[[(dictionary[id], freq) for id, freq in cp] for cp in corpus_bow[:1]]
Out[61]:
[[('appear', 1),
  ('beef', 1),
  ('blow', 1),
  ('burger', 4),
  ('crap', 1),
  ('flavor', 1),
  ('focus', 1),
  ('ground', 1),
  ('kroger', 1),
  ('lunch', 1),
  ('meat', 1),
  ('meh', 1),
  ('menu', 1),
  ('patty', 2),
  ('pile', 1),
  ('state', 1),
  ('steam', 1),
  ('water', 1)]]
In [62]:
ldaTopicNum = 15
In [63]:
lda_model = ldamodel.LdaModel(corpus=corpus_bow, # Stream of document vectors or sparse matrix of shape (num_documents, num_terms)
                                id2word=dictionary, # Mapping from word IDs to words. It is used to determine the vocabulary size, as well as for debugging and topic printing.
                                num_topics=ldaTopicNum, # The number of requested latent topics to be extracted from the training corpus.
                                passes=10, #Number of passes through the corpus during training
                                per_word_topics=True) # computes a list of topics, sorted in descending order of most likely topics for each word, along with their phi values multiplied by the feature length
In [68]:
import warnings
warnings.filterwarnings('ignore')
def worCloudPertopic(model, numTopic, row, col):
    cloudLda = WordCloud(stopwords=stop_words,
                    background_color='white',
                    max_words=100,
                    colormap='tab10')


    fig, axes = plt.subplots(row,col, figsize=(20,15), sharex=True, sharey=True)
    fig.tight_layout()
    for i, ax in enumerate(axes.flatten()):
        fig.add_subplot(ax)
        topic_words = " ".join(x[0] for x in model.show_topics(numTopic,formatted=False, num_words=30)[i][1])
        cloudLda.generate(topic_words)
        plt.gca().imshow(cloudLda)
        plt.gca().set_title('Topic ' + str(i))
        plt.gca().axis('off')


    plt.subplots_adjust(wspace=0, hspace=0)
    plt.axis('off')
    plt.margins(x=0, y=0)
    plt.tight_layout()
    plt.show()

worCloudPertopic(lda_model, ldaTopicNum, 5, 3)

Summarizing Topics¶

In [70]:
for i in range(0,ldaTopicNum):
    print("Topic : ", i)
    print(" ".join(x[0] for x in lda_model.show_topics(ldaTopicNum,formatted=False, num_words=10)[i][1]))
Topic :  0
flower okra surgery steakhouse designer penny bone label tahoe watermelon
Topic :  1
nail salon pedicure job gel color manicure look polish come
Topic :  2
order fry cheese burger chicken sandwich sauce try like potato
Topic :  3
store dress love wedding class selection shop buy staff room
Topic :  4
park parking museum tree lot sunset winery game gras drive
Topic :  5
food service place come drink order wait table time restaurant
Topic :  6
ice cream dog bagel chocolate place staff love shop dr
Topic :  7
sushi roll breakfast coffee egg santa brunch donut barbara salmon
Topic :  8
pizza food place order try service love lunch time restaurant
Topic :  9
car service work day time need tell customer come guy
Topic :  10
sauce pork chicken shrimp bbq dish salad rib fish try
Topic :  11
food like place beer time wine meat try love taste
Topic :  12
room hotel stay staff lot area tour location airport walk
Topic :  13
time ask tell like come place want look order service
Topic :  14
place taco like food eat salsa try love chip mexican

LDA Sklearn library¶

In [71]:
from sklearn.decomposition import LatentDirichletAllocation
feature_names =  vectorizer.get_feature_names_out() 

lda = LatentDirichletAllocation(
        n_components=ldaTopicNum,
        max_iter=20
)
lda.fit_transform(idfVector.toarray())
Out[71]:
array([[0.39725817, 0.02822674, 0.02822674, ..., 0.02822674, 0.23579419,
        0.02822674],
       [0.02452442, 0.02452443, 0.02452465, ..., 0.02452443, 0.02452444,
        0.02452442],
       [0.02456825, 0.02456825, 0.02456825, ..., 0.02456825, 0.02456825,
        0.02456825],
       ...,
       [0.02487234, 0.02487233, 0.20476788, ..., 0.02487235, 0.02487233,
        0.02487233],
       [0.01837954, 0.01837954, 0.01837953, ..., 0.22469078, 0.01837953,
        0.01837954],
       [0.02273261, 0.02273253, 0.02273251, ..., 0.02273251, 0.02273253,
        0.02273252]])
In [77]:
def plot_top_words(model, feature_names, n_top_words, title, row, col):
    #Modified from SKlearn
    fig, axes = plt.subplots(row,col, figsize=(20, 20))
    fig.tight_layout()
    axes = axes.flatten()
    for topic_idx, topic in enumerate(model.components_):
        top_features_ind = topic.argsort()[:-n_top_words - 1:-1]
        top_features = [feature_names[i] for i in top_features_ind]
        weights = topic[top_features_ind]

        ax = axes[topic_idx]
        ax.barh(top_features, weights, height=0.6, color="#7451eb")
        ax.set_title(f'Topic {topic_idx +1}',
                     fontdict={'fontsize': 13})
        ax.invert_yaxis()
        ax.tick_params(axis='both', which='major', labelsize=16)
        for i in 'top right left'.split():
            ax.spines[i].set_visible(False)
        fig.suptitle(title, fontsize=15)
        ax.tick_params(bottom=False)
        ax.set(xticklabels=[])

    plt.subplots_adjust(top=0.93, bottom=0.02, wspace=0.6, hspace=0.14)
    plt.show()

plot_top_words(lda, feature_names, 10,'Topics in LDA', 5,3)
In [78]:
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic {}:".format(topic_idx))
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))


display_topics(lda, feature_names, 10)
Topic 0:
small easy second good great lucky solid willing particular unprofessional
Topic 1:
large local good incredible great poor professional available affordable classic
Topic 2:
bad special extra good personal courteous basic possible new nice
Topic 3:
hot great good live entire comfortable little true satisfied traditional
Topic 4:
great good little fresh high impressed flat big organic numerous
Topic 5:
busy free french open good soft difficult tiny pleased ridiculous
Topic 6:
low single good chinese normal negative main strong important great
Topic 7:
fantastic hard knowledgeable good able great fabulous vegetarian exceptional real
Topic 8:
happy new black healthy great good little gross natural nice
Topic 9:
reasonable old expensive good great white typical quiet casual nice
Topic 10:
ready regular average good original safe tough bad additional enjoyable
Topic 11:
disappointed wrong attentive good outstanding green complete major great personable
Topic 12:
nice terrible short good huge authentic great big similar dead
Topic 13:
different italian good social specific necessary small new broad upper
Topic 14:
delicious horrible great good overall fresh red daily little bad

Non-Negative Matrix factorization¶

In [79]:
nfmTopicNumber = 15
In [81]:
from gensim.models import Nmf

nfm_model = Nmf(corpus=corpus_bow, # Stream of document vectors or sparse matrix of shape (num_documents, num_terms)
                                id2word=dictionary, # Mapping from word IDs to words. It is used to determine the vocabulary size, as well as for debugging and topic printing.
                                num_topics=nfmTopicNumber, # The number of requested latent topics to be extracted from the training corpus.
                                passes=10) #Number of passes through the corpus during training
In [82]:
worCloudPertopic(nfm_model, nfmTopicNumber, 5, 3)
In [83]:
for i in range(0,nfmTopicNumber):
    print("Topic : ", i)
    print(" ".join(x[0] for x in nfm_model.show_topics(nfmTopicNumber,formatted=False, num_words=10)[i][1]))
Topic :  0
pizza experience recommend visit staff love year restaurant work price
Topic :  1
order chicken fry sauce come sandwich cheese dish eat salad
Topic :  2
hair appointment ask salon love cut animal time zoo leave
Topic :  3
know like taco fish want review price thing work business
Topic :  4
look like dress try feel nail walk way decide seat
Topic :  5
wait service minute ask order customer sit bar restaurant seat
Topic :  6
car work tell day alignment drive need hour pay pm
Topic :  7
ice cream like flavor roll chocolate love cake try sauce
Topic :  8
beer bar try selection time night drink cheese burger menu
Topic :  9
service time customer come try place food location star receive
Topic :  10
food people like taste chicken day love location review thing
Topic :  11
room hotel stay time area people breakfast work lot staff
Topic :  12
food table restaurant drink eat menu dinner come server meal
Topic :  13
place love burger try recommend sandwich drink order want price
Topic :  14
come tell day ask time want work check phone pay

NMF Sklearn library¶

In [84]:
from sklearn.decomposition import NMF

nmf = NMF(n_components=nfmTopicNumber , max_iter=20)
nmf.fit_transform(idfVector.toarray())
Out[84]:
array([[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [3.24991929e-04, 4.19272097e-04, 4.06676911e-03, ...,
        1.44567139e-02, 0.00000000e+00, 1.10224114e-02],
       [5.75158917e-04, 1.19162748e-03, 0.00000000e+00, ...,
        0.00000000e+00, 1.75829363e-01, 0.00000000e+00],
       ...,
       [1.09736133e-04, 0.00000000e+00, 2.37331655e-04, ...,
        1.56027879e-01, 0.00000000e+00, 2.43606229e-03],
       [1.34688251e-03, 4.15642859e-02, 5.60640331e-02, ...,
        0.00000000e+00, 5.24868343e-04, 3.46820596e-04],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        1.40661755e-01, 1.09685396e-03, 0.00000000e+00]])
In [85]:
plot_top_words(nmf, feature_names, 10,'Topics in NFM', 5,3)
In [86]:
display_topics(nmf, feature_names, 10)
Topic 0:
good overall disappointed attentive french second local expensive old open
Topic 1:
great reasonable overall local easy big knowledgeable huge old high
Topic 2:
nice special overall big free local open reasonable black able
Topic 3:
little special able tiny average high red free hard big
Topic 4:
bad horrible terrible old open wrong poor entire hard low
Topic 5:
fresh huge local old big healthy high soft able available
Topic 6:
new old ready open free professional big special hard local
Topic 7:
small high open local big easy regular overall hard authentic
Topic 8:
hot huge special big ready extra easy chinese main regular
Topic 9:
happy old able reasonable attentive outstanding big special free disappointed
Topic 10:
different big second disappointed huge able high terrible extra available
Topic 11:
delicious special extra healthy free authentic attentive impressed green chinese
Topic 12:
busy big attentive ready free horrible old able real poor
Topic 13:
fantastic old big easy personable knowledgeable free short able high
Topic 14:
large huge open local special high able entire outstanding comfortable
In [87]:
processedReviewData['ldaGemsimTopics'] = processedReviewData['token_lda'].apply(lambda x : lda_model.get_document_topics(dictionary.doc2bow(x), minimum_probability=0.1))
processedReviewData['clusterLabels'] = idfTsneDf['labels']
processedReviewData['nfmGemsimTopics'] = processedReviewData['token_lda'].apply(lambda x : nfm_model.get_document_topics(dictionary.doc2bow(x), minimum_probability=0.1))
processedReviewData[['text','ldaGemsimTopics', 'clusterLabels', 'nfmGemsimTopics']]
Out[87]:
text ldaGemsimTopics clusterLabels nfmGemsimTopics
0 Went for lunch and found that my burger was me... [(2, 0.2765501), (8, 0.12674528), (11, 0.20354... 0 [(1, 0.23430688282735607), (7, 0.1013745352596...
1 I needed a new tires for my wife's car. They h... [(9, 0.72247756), (13, 0.21561399)] 0 [(1, 0.30649012717452373), (6, 0.6057575442527...
2 Jim Woltman who works at Goleta Honda is 5 sta... [(1, 0.9377559)] 0 [(0, 0.3981311405112851), (3, 0.13720122666155...
3 Been here a few times to get some shrimp. The... [(11, 0.21335638), (14, 0.690344)] 2 [(3, 0.2126430775025083), (8, 0.16414513683691...
4 This is one fantastic place to eat whether you... [(8, 0.23385039), (11, 0.67947745)] 3 [(0, 0.21412922148957647), (13, 0.675145364135...
... ... ... ... ...
4994 I am not sure what to think of this place. I b... [(8, 0.11680629), (9, 0.16956975), (13, 0.6777... 0 [(2, 0.1954964385554125), (9, 0.22997305655587...
4995 I'm so excited to see the red Robin had re-ope... [(2, 0.21591565), (5, 0.7064889)] 0 [(1, 0.12841763016219607), (5, 0.5821628352784...
4996 This is our go-to pizza place! We love their ... [(8, 0.6359706), (13, 0.25112677)] 0 [(0, 0.6627956410121099), (10, 0.1427711627113...
4997 This is located in a great spot fairly close t... [(4, 0.123830244), (7, 0.20269445), (12, 0.312... 0 [(3, 0.20750610054368798), (4, 0.1515471170643...
4998 I went in for a sirloin burger and a salad. Th... [(2, 0.2131707), (3, 0.12926008), (11, 0.20708... 0 [(1, 0.2092417280021048), (8, 0.18271243524895...

4999 rows × 4 columns

In [88]:
df_ldaVis = pd.DataFrame([val for sublist in processedReviewData['ldaGemsimTopics'] for val in sublist])
df_nfmVis = pd.DataFrame([val for sublist in processedReviewData['nfmGemsimTopics'] for val in sublist])

fig, ax = plt.subplots(1, 3, figsize=(20,7), tight_layout=True)
sns.kdeplot(data = df_ldaVis,x=1, hue=0,  ax=ax[0], palette='Set1')
ax[0].set_title('Distribution of Topics ratio for the reviews (LDA)')
ax[0].set_xlabel('Percentage of Topic present')
ax[0].set_ylabel('Reviews')

sns.kdeplot(data = processedReviewData, x='clusterLabels', ax=ax[1], palette='Set1')
ax[1].set_title('Distribution of Clusters over reviews')
# ax[1].set_xlabel('Labels')
# ax[1].set_ylabel('Component 2')

sns.kdeplot(data = df_nfmVis,x=1, hue=0, ax=ax[2], palette="Set1")
ax[2].set_title('Distribution of Topics ratio for the reviews (NFM)')
ax[2].set_xlabel('Percentage of Topic present')
ax[2].set_ylabel('Reviews')
Out[88]:
Text(0, 0.5, 'Reviews')

Conclusion¶

Kmeans Clustering: According to the word representation of the clusters we can label the group by summarising the sense of the word present in the respective clusters.

  • Cluster 0: Positive
  • Cluster 1: Positive
  • Cluster 2: Positive
  • Cluster 3: Negative
Topics LDA topics (Gensim model) NMF topics (Gensim model)
Topic 0 food order car service
Topic 1 car, time and day, pay Animal-friendly place
Topic 2 food order wait time Fast food
Topic 3 Ambiance, surrounding Delivery
Topic 4 cream ice place Nice Interior
Topic 5 sushi, tour guide Bar
Topic 6 food service Services
Topic 7 Mexican food Salon
Topic 8 delivery City tour
Topic 9 Hotel Restaurant
Topic 10 Polish Fast food
Topic 11 Customer service ice cream, Job
Topic 12 Shopping place service
Topic 13 Art hotel, resort
Topic 14 Salon pizza store